<!DOCTYPE html>
<html>
    <head>
        <meta name="viewport" content="width=device-width,initial-scale=1.0,maximum-scale=1.0,minimum-scale=1.0,user-scalable=no">
        <title>ProteinKG65</title>
        <style type="text/css">
            #none{
                text-decoration: none;
                color: blue;
            }
            #overline{
                text-decoration: overline;
            }
            #line-through{
                text-decoration: line-through;
            }
            .footer{
                font-family: Times;
                font-size: 20px;
                line-height: 1.5;
                word-wrap: break-word;
                box-sizing: border-box;
                border-color: #eaecef;
                border-top: 1px solid #595959;;
                color: #586069;
                margin-top: 32px;
                margin-bottom: 20px;
                padding-top: 16px;
                padding-bottom: 40px;
                text-align: right;
                margin-bottom: 0;
            }
            .d1{
                width: 1000px;
                height: 1000px;
                line-height: 40px;
                margin-left: 15%;
                margin-right: 15%;
                font-size: 18px;
                position: center;
                margin: 0 auto;
            }
            .set_table{
                line-height: 30px;
                font-style: italic;
                font-family: Times;
            }
            .add_border_line{
                border-bottom: 1px solid #595959;
                border-top: 1px solid #595959;
            }
            .add_top_border_line{
                border-top: 1px solid #595959;
            }
        </style>
    </head>
    <body>
        <div class="d1">
            <h1>ProteinKG65</h1>
            <hr /><center>
            <img src="../figures/part1.svg" width="800" style="display: block; background-position: 0px 50px;">
            <img src="../figures/part2.svg" width="800" style="display: block;"></center>


            <h2>Introduction</h2>
            <hr />
            <p>ProteinKG65 is a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO term and proteins entities. It contains about 614,099 entities, 5,620,437 triples (including 5,510,437 protein-go triplets and 110,000 Go-Go triplets). We provide both the inductive and the transductive settings used in the <a href="https://arxiv.org/pdf/2207.10080.pdf" id="none">original paper</a>.
            </p>
            <div style="width: 100%; background-color: white;">
                <table width="900" border="0" cellpadding="7" cellspacing="0" bgcolor="#FFFFFF" style="display: inline-block;" class="set_table">
                  <tr>
                    <td bgcolor="#F6F6F6" class="add_border_line"><div align="center">Setting</div></td>
                    <td bgcolor="#F6F6F6" class="add_border_line"><div align="center"></div>&emsp;&emsp;</td>
                    <td bgcolor="#F6F6F6" class="add_border_line"><div align="center">&emsp;#protein entity</div></td>
                    <td bgcolor="#F6F6F6" class="add_border_line"><div align="center">&emsp;#go entity</div></td>
                    <td bgcolor="#F6F6F6" class="add_border_line"><div align="center">&emsp;#relation</div></td>
                    <td bgcolor="#F6F6F6" class="add_border_line"><div align="center">&emsp;#triplet</div></td>
                  </tr>
                  <tr>
                    <td bgcolor="#FFFFFF"><div align="center"></div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;train</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;543,110</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;28,524</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;57</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;4,884,034</div></td>
                  </tr>
                  <tr>
                    <td bgcolor="#F6F6F6"><div align="center">Transductive</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;valid</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;25,241</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;5,009</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;44</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;51,243</div></td>
                  </tr>
                  <tr>
                    <td bgcolor="#FFFFFF"><div align="center"></div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;test</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;217,463</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;17,908</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;57</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;575,160</div></td>
                  </tr>
                  <tr>
                    <td bgcolor="#F6F6F6" class="add_top_border_line"><div align="center"></div></td>
                    <td bgcolor="#F6F6F6" class="add_top_border_line"><div align="center">&emsp;&emsp;train</div></td>
                    <td bgcolor="#F6F6F6" class="add_top_border_line"><div align="center">&emsp;&emsp;543,110</div></td>
                    <td bgcolor="#F6F6F6" class="add_top_border_line"><div align="center">&emsp;&emsp;28,524</div></td>
                    <td bgcolor="#F6F6F6" class="add_top_border_line"><div align="center">&emsp;&emsp;57</div></td>
                    <td bgcolor="#F6F6F6" class="add_top_border_line"><div align="center">&emsp;&emsp;4,884,034</div></td>
                  </tr>
                  <tr>
                    <td bgcolor="#FFFFFF"><div align="center">Induvtive</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;valid</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;855</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;270</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;31</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;2,216</div></td>
                  </tr>
                  <tr>
                    <td bgcolor="#F6F6F6"><div align="center"></div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;test</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;3,085</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;1,062</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;50</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;11,127</div></td>
                  </tr>
                </table>
                
            </div>
            <p>We get Go term information from <a href="http://geneontology.org/" id='none'>Gene Ontology</a>. Gene Ontology is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species, which consists of a set of Go terms(or concepts) with relations that operate between them. The vocabulary of genes and gene products involved in Gene Ontology is divided into three categories, covering three aspects of biology:</p>
            <ul style="background-color: #FBEFF2; height: 126px; padding-top: 5px; border-radius: 15px; padding-left: 100px;">
                <li>Molecular Function</li>
                <li>Cellular Component</li>
                <li>Biological Process</li>
            </ul>
            
            <h2>Go term</h2>
            <hr />
            <p>All entities in Gene Ontology belong to BPO, MFO and CCO. The relationship between Go term and Protein is mainly concentrated in the following tree types: </p>
            <ul style="background-color: #FBEFF2; height: 126px; padding-top: 5px; border-radius: 15px; padding-left: 100px;">
                <li>located in &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&ensp;&emsp;&emsp;&emsp;&emsp;(1,111,771)</li>
                <li>enables &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;(1,946,834)</li>
                <li>involved in &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&ensp;&emsp;&emsp;&emsp;&emsp;(1,720,804)</li>
            </ul>

            <p>To mitigate this severe long-tail effect, we refine some of these relationships to keep the data balanced. The total number of relationships after refinement is 65, compared with 25 before.</p>


            <h2>Data</h2>
            <hr />
            <p>
                <a href="https://drive.google.com/file/d/1hwtOz6zvzfMB8h4GgojdZ6ZmQclAVwZS/view?usp=sharing">Download ProteinKG65</a>
            </p>
            <p>
                <a href="https://zenodo.org/record/6517583#.YoRQsJNBzVk">Zenodo</a>
            </p>

            <h2>Details</h2>
            <hr />
            ProteinKG65 follows Gene Ontology and Gene Annotations released in April 2020. Each Entity (Go term entity or Protein entity) is identified by a unique ID.
            <p>
                Protein-Go triplet example:
                <div style="background-color: #FBEFF2; height: 40px; padding-top: 5px; border-radius: 15px; padding-left: 100px; margin: 20px;">71090(Q14028) &emsp;&emsp;&emsp;&emsp;&emsp; 36(enables_nucleotide_binding) &emsp;&emsp;&emsp;&emsp;&emsp;  117(GO:0000166)</div>
                GO-Go triplet example:
                <div style="background-color: #FBEFF2; height: 40px; padding-top: 5px; border-radius: 15px; padding-left: 100px; margin: 20px;">0(GO:0000001) &emsp;&emsp;&emsp;&emsp;&emsp; 0(is_a) &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;  23558(GO:0048308)</div>
                Protein Sequence example:
                <div style="background-color: #FBEFF2; height: 40px; padding-top: 5px; border-radius: 15px; padding-left: 100px; margin: 20px;">P0DQM8: &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;QKCCSGGSCPLYFRDRLICPCC</div>
            </p>


            <h2>Publication</h2>
            <hr />
            <ul>
  
                  <li>
                    <b>Multi-modal Protein Knowledge Graph Construction and Applications</b>
                    <br>
                    
                 
                    
                    
                    Siyuan Cheng,
                      
                   Xiaozhuan Liang,
                     
                    Zhen Bi,
                    
                    Huajun Chen,
                     
                     Ningyu Zhang,
                    
                    <br>
      
                    
                    <a href="https://arxiv.org/pdf/2207.10080.pdf" style="padding-right: 10px" id="none">arXiv</a>
                    
                  </li>
                  
                </ul>
                <h2>Cite</h2>
                <hr />
                <div style="background-color: #FBEFF2; height: 270px; padding-top: 3px; border-radius: 15px; padding-left: 30px; font-size: 15px; font-style:italic;">
@article{<br>
  cheng2022proteinkg65,<br>
  title={Multi-modal Protein Knowledge Graph Construction and Applications},<br>
  author={Cheng, Siyuan and Liang, Xiaozhuan and Bi, Zhen and  Chen, Huajun and Zhang, Ningyu},<br>
  journal={arXiv preprint arXiv:2207.10080},<br>
  year={2022}<br>
}
                </div>
                <div class="footer">
                    AZFT & ZJUNLP
                </div>


        </div>
        
    </body>
</html>
